Unknown word handling in structured text classification
نویسندگان
چکیده
منابع مشابه
Hybrid POS tagging with generalized unknown-word handling
This paper presents POSTAG 1 as a statistical/rule-based hybrid part-of-speech (POS) tagging system with generalized unknown-word handling. The POSTAG integrates morphological analysis with statistical POS disambigua-tion and post rule-based error-correction. The error-correction rules are automatically learned from a tagged corpus and selectively correct standard HMM tagging errors. The morpho...
متن کاملA Faster Structured-Tag Word-Classification Method
Several methods have been proposed for processing a corpus to induce a tagset for the sublanguage represented by the corpus. This paper examines a structured-tag word classification method introduced by McMahon (1994). Two major variations (non-random initial assignment of words to classes and moving multiple words in parallel) together provide robust non-random results with a speed increase of...
متن کاملFlexible Text Segmentation with Structured Multilabel Classification
Many language processing tasks can be reduced to breaking the text into segments with prescribed properties. Such tasks include sentence splitting, tokenization, named-entity extraction, and chunking. We present a new model of text segmentation based on ideas from multilabel classification. Using this model, we can naturally represent segmentation problems involving overlapping and non-contiguo...
متن کاملDatabase-Text Alignment via Structured Multilabel Classification
This paper addresses the task of aligning a database with a corresponding text. The goal is to link individual database entries with sentences that verbalize the same information. By providing explicit semantics-to-text links, these alignments can aid the training of natural language generation and information extraction systems. Beyond these pragmatic benefits, the alignment problem is appeali...
متن کاملInfluence of Word Normalization on Text Classification
In this paper we focus our attention on the comparison of various lemmatization and stemming algorithms, which are often used in nature language processing (NLP). Sometimes these two techniques are considered to be identical, but there is an important difference. Lemmatization is generally more utilizable, because it produces the basic word form which is required in many application areas (i.e....
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Journal of Physics: Conference Series
سال: 2021
ISSN: 1742-6588,1742-6596
DOI: 10.1088/1742-6596/1727/1/012017